Goto

Collaborating Authors

 deep linear residual network


Global Convergence of Gradient Descent for Deep Linear Residual Networks

Neural Information Processing Systems

We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an $\varepsilon$-optimal point in $O\left( L^3 \log(1/\varepsilon) \right)$ iterations, which scales polynomially with the network depth $L$. Our result and the $\exp(\Omega(L))$ convergence time for the standard initialization (Xavier or near-identity) \cite{shamir2018exponential} together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when $L$ is large.


Reviews: Global Convergence of Gradient Descent for Deep Linear Residual Networks

Neural Information Processing Systems

Response to authors' feedback: I thank the authors for the rebuttal. My score remains the same. With this initialization, the networks are shown to converge linearly to zero loss, under conditions (for discrete-time GD) that are different from and perhaps conceptually simpler than previous works. For instance, compared to reference [2] (Arora et al "A convergence analysis of gradient descent for deep linear neural networks", ICLR 2019), this work removes completely the delta-balanced condition in [2] by showing that this condition actually holds, for most layers, on the GD trajectory (Lemma 4.2 and Eq. While certain elements have already been seen in previous works (e.g. the property in Lemma 4.2 is similar to the delta-balanced condition in [2], or the requirement of zero initialization for the last layer's weight has been seen in "fixup initialization" of reference [21] in the context of residual networks), I think the proposed initialization as well as the convergence analysis here deserve credits for novelty.


Reviews: Global Convergence of Gradient Descent for Deep Linear Residual Networks

Neural Information Processing Systems

The reviewers appreciated the work on the initialization even if they deemed it incremental. The experiments on the nonlinear network in the rebuttal was useful and I encourage the authors to expand the experimental section using more realistic setups to show how the theory matters in practice.


Global Convergence of Gradient Descent for Deep Linear Residual Networks

Neural Information Processing Systems

We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an \varepsilon -optimal point in O\left( L 3 \log(1/\varepsilon) \right) iterations, which scales polynomially with the network depth L . Our result and the \exp(\Omega(L)) convergence time for the standard initialization (Xavier or near-identity) \cite{shamir2018exponential} together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when L is large.


Global Convergence of Gradient Descent for Deep Linear Residual Networks

Neural Information Processing Systems

We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an $\varepsilon$-optimal point in $O\left( L 3 \log(1/\varepsilon) \right)$ iterations, which scales polynomially with the network depth $L$. Our result and the $\exp(\Omega(L))$ convergence time for the standard initialization (Xavier or near-identity) \cite{shamir2018exponential} together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when $L$ is large. Papers published at the Neural Information Processing Systems Conference.